NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Li, Jiachen; Wang, Xinyao; Zhu, Sijie; Kuo, Chia-Wen; Xu, Lu; Chen, Fan; Jain, Jitesh; Shi, Humphrey; Wen, Longyin (December 2024, NeurIPS 2024)

Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. However, these scaling approaches are computationally expensive and overlook the significance of efficiently improving model capabilities from the vision side. Inspired by the successful applications of Mixture-of-Experts (MoE) in LLMs, which improves model scalability during training while keeping inference costs similar to those of smaller models, we propose CuMo, which incorporates Co-upcycled Top-K sparsely-gated Mixtureof-experts blocks into both the vision encoder and the MLP connector, thereby enhancing the multimodal LLMs with neglectable additional activated parameters during inference. CuMo first pre-trains the MLP blocks and then initializes each expert in the MoE block from the pre-trained MLP block during the visual instruction tuning stage, with auxiliary losses to ensure a balanced loading of experts. CuMo outperforms state-of-the-art multimodal LLMs across various VQA and visual-instruction-following benchmarks within each model size group, all while training exclusively on open-sourced datasets.
more » « less
Full Text Available
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Li, Jiachen; Wang, Xinyao; Zhu, Sijie; Kuo, Chia-Wen; Xu, Lu; Chen, Fan; Jain, Jitesh; Shi, Humphrey; Wen, Longyin (December 2024, Advances in Neural Information Processing Systems 37 (NeurIPS 2024))

Full Text Available
VIGOR: Cross-View Image Geo-localization beyond One-to-one Retrieval

https://doi.org/10.1109/CVPR46437.2021.00364

Zhu, Sijie; Yang, Taojiannan; Chen, Chen (June 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))

Full Text Available
3D Human Pose Estimation with Spatial and Temporal Transformers

https://doi.org/10.1109/ICCV48922.2021.01145

Zheng, Ce; Zhu, Sijie; Mendieta, Matias; Yang, Taojiannan; Chen, Chen; Ding, Zhengming (October 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV))

Full Text Available
Visual Explanation for Deep Metric Learning

https://doi.org/10.1109/TIP.2021.3107214

Zhu, Sijie; Yang, Taojiannan; Chen, Chen (January 2021, IEEE Transactions on Image Processing)
null (Ed.)
Full Text Available
MutualNet: Adaptive ConvNet via Mutual Learning from Different Model Configurations

https://doi.org/10.1109/TPAMI.2021.3138389

Yang, Taojiannan; Zhu, Sijie; Mendieta, Matias; Wang, Pu; Balakrishnan, Ravikumar; Lee, Minwoo; Han, Tao; Shah, Mubarak; Chen, Chen (January 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence)

Full Text Available
Density Map Guided Object Detection in Aerial Images

Li, Changlin; Yang, Taojiannan; Zhu, Sijie; Chen, Chen; Guan, Shanyue (January 2020, IEEE Conference on Computer Vision and Pattern Recognition Workshop)

Object detection in high-resolution aerial images is a challenging task because of 1) the large variation in object size, and 2) non-uniform distribution of objects. A common solution is to divide the large aerial image into small (uniform) crops and then apply object detection on each small crop. In this paper, we investigate the image cropping strategy to address these challenges. Specifically, we propose a Density-Map guided object detection Network (DMNet), which is inspired from the observation that the object density map of an image presents how objects distribute in terms of the pixel intensity of the map. As pixel intensity varies, it is able to tell whether a region has objects or not, which in turn provides guidance for cropping images statistically. DMNet has three key components: a density map generation module, an image cropping module and an object detector. DMNet generates a density map and learns scale information based on density intensities to form cropping regions. Extensive experiments show that DMNet achieves state-of-the-art performance on two popular aerial image datasets, i.e. VisionDrone and UAVDT.
more » « less
Full Text Available
MutualNet: Adaptive ConvNet via Mutual Learning from Network Width and Resolution

Yang, Taojiannan; Zhu, Sijie; Chen, Chen; Yan, Shen; Zhang, Mi; Willis, Andrew (January 2020, European Conference on Computer Vision)

We propose the width-resolution mutual learning method (MutualNet) to train a network that is executable at dynamic resource constraints to achieve adaptive accuracy-efficiency trade-offs at runtime. Our method trains a cohort of sub-networks with different widths (i.e., number of channels in a layer) using different input resolutions to mutually learn multi-scale representations for each sub-network. It achieves consistently better ImageNet top-1 accuracy over the state-of-the-art adaptive network US-Net under different computation constraints, and outperforms the best compound scaled MobileNet in EfficientNet by 1.5%. The superiority of our method is also validated on COCO object detection and instance segmentation as well as transfer learning. Surprisingly, the training strategy of MutualNet can also boost the performance of a single network, which substantially outperforms the powerful AutoAugmentation in both efficiency (GPU search hours: 15000 vs. 0) and accuracy (ImageNet: 77.6% vs. 78.6%). Code is available at https://github.com/ aoyang1122/MutualNet
more » « less
Full Text Available

Search for: All records